Don’t look as good
Hard to build more complex plots, and fine-tune
ggplot2What is a statistical graphic?
Take variables from a dataset
map them to aes()thetic
attributes
of geom_etric objects
How are variables mapped to aesthetic attributes of points?
Construct a graphic by adding modular pieces
ggplot(data, mapping)
Define aesthetic mappings with aes() function
aes(x = var1, y = var2)Add ‘layers’ of geometric objects
geom_point()Adjustments to axis scales, colors, labels, aesthetic mods
“Chaining” together ggplot components (use + rather
than %>%)
+ rather than %>% is unfortunate and
hard to remember!The key is to understand the concepts and basic mechanics
The details for any given plot type, or attribute are easy to look up
gap_92 <- gapminder %>%
filter(year == 1992) %>%
mutate(gdp = gdpPercap * pop / 1e9)
gap_92 %>% head(4)# A tibble: 4 × 7
country continent year lifeExp pop gdpPercap gdp
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1992 41.7 16317921 649. 10.6
2 Albania Europe 1992 71.6 3326498 2497. 8.31
3 Algeria Africa 1992 67.7 26298373 5023. 132.
4 Angola Africa 1992 40.6 8735988 2628. 23.0
Change how data values are translated to visual properties
scale_x_log10(), scale_y_reverse()Change limits of axes:
xlim(0, 10)Applies to other attributes as well
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp, shape = continent)) +
geom_point() +
scale_x_log10() ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10() labs function adds custom axis labels and titles
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) +
geom_point() +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)',
y = 'Life Expectancy at birth (years)',
title = 'Gapminder for 1992')Comparing 2 continuous variables
Scatterplot: geom_point()
Line graph: geom_line()
Smoothing functions: geom_smooth()
Summarizing distribution of a single variable
Histogram: geom_histogram()
Density: geom_density()
Discrete vs continuous
Boxplot: geom_boxplot
Bar graph: geom_col()
Violin plot: geom_violin()
And many more…
df <- gapminder %>%
filter(country == 'Romania')
ggplot(df, mapping = aes(x = year, y = lifeExp)) +
geom_line()We can add as many geoms to a plot as we want, stacked on as ‘layers’ in order
What if we had multiple data points per year?
df <- gapminder %>%
filter(country %in% c('Romania', 'Thailand'))
ggplot(df, mapping = aes(x = year, y = lifeExp)) +
geom_line() +
geom_point()Need to separate them by country (group aesthetic)
Often useful to color lines by group, use color
aesthetic with a categorical variable and it automatically groups
ggplot() but can override this
for individual ‘geoms’ggplot(df, mapping = aes(x = year, y = lifeExp)) +
geom_line(mapping = aes(color = country)) +
geom_point()ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) +
geom_line(linetype = 'dashed', size = 0.5) +
geom_point(color = 'black', size = 3, alpha = 0.75)How to depict the ‘average’ relationship between noisy variables?
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) +
geom_point() +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') geom_line() doesn’t work!
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) +
geom_line() +
geom_point() +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') geom_smooth() shows the average (‘smoothed’)
relationship
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) +
geom_point() +
geom_smooth() +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') Can be used to show a linear trendline
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) +
geom_point() +
geom_smooth(method = 'lm') +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') Can be very helpful to condense down relationships from complicated data
ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10() Can be very helpful to condense down relationships from complicated data
ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_smooth(method = 'lm') +
scale_x_log10() Above were all examples based around plotting 2 continuous variables (other ‘aesthetics’ can encode additional variables
Other common scenarios are:
Plot distribution of a single variable (continuous or discrete)
Plot the distribution of a continuous variable against a discrete variable
Given a single discrete variable we can plot its distribution as a
‘bar plot’ using geom_bar()
For a single continuous variable, we can generate a histogram using
geom_histogram which bins the values and then makes a bar
plot
We can adjust the axis scale and other features as usual
We can change the number of bins (can also specify details of bin positions)
Can also encode different continents in different colors by stacking the histograms
ggplot(gapminder, mapping = aes(x = gdpPercap, color = continent)) +
geom_histogram() +
scale_x_log10()ggplot(gapminder, mapping = aes(x = gdpPercap, fill = continent)) +
geom_histogram() +
scale_x_log10()Density plots are another way to depict the distribution of a continuous variable. They are just a smoothed histogram
Separate by continent and give spearate fill colors
ggplot(gapminder, mapping = aes(x = gdpPercap, fill = continent)) +
geom_density(alpha = 0.5) +
scale_x_log10()The boxplot is the most common choice for showing the distribution of a continuous variable broken down by a categorical variable
The violin plot is similar, but shows the distribution as a density plot, rather than a box.
Another useful option is a ‘dotplot’ or ‘beeswarm’ plot.
library(ggbeeswarm)
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
geom_beeswarm(size = 0.5, alpha = 0.75, cex = 1) +
scale_y_log10()By default x-axis values ordered alphabetically
Need to use the idea of a factor
Factors used to encode categorical variables, specify the possible ‘levels’, and optionally an ordering
cont_order <- c('Oceania', 'Europe', 'Americas', 'Asia', 'Africa')
gap_cat <- gapminder %>%
mutate(continent = factor(continent, levels = cont_order))
head(gap_cat)# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<chr> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
forcats package has lots of useful helper functions for changing order of factor variables.
gap_cat <- gap_cat %>%
mutate(continent = fct_reorder(continent, gdpPercap, median))
ggplot(gap_cat, mapping = aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
scale_y_log10()If you want to plot a single value for each of a continuous variable,
use geom_col
gap_82 <- gapminder %>%
filter(year == 1982, continent == 'Americas')
ggplot(gap_82, mapping = aes(x = country, y = gdpPercap)) +
geom_col()You can customize MANY details of the plot using the
theme function
It’s a bit complicated at first, but most common changes are easy to google.
ggsaveggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
geom_violin() +
scale_y_log10()
ggsave(filename = here::here('results', 'my_fig.png'))You don’t need to remember the details, just the basic mechanics. You can quickly look up the details (check out this useful ggplot cheat sheet)
Find example plots online that you like and just copy/paste as a template. Browse the ggplot gallery
If we map a continuous variable to color it won’t group automatically
ggplot(df, mapping = aes(x = year, y = lifeExp, color = gdpPercap)) +
geom_line() +
geom_point(size = 3)We need to specify group manually
ggplot(df, mapping = aes(x = year, y = lifeExp,
group = country, color = gdpPercap)) +
geom_line() +
geom_point(size = 3)Assume continuous map for numeric data, discrete map for strings
Make numeric data into factors if you want discrete colors
my_df <- gapminder %>%
filter(year %in% c(1957, 1977, 1997))
ggplot(my_df, mapping = aes(x = gdpPercap, y = lifeExp, color = factor(year))) +
geom_point() +
scale_x_log10() +
labs(color = 'year')We can use scale_color_manual to set the color of each
group manually
my_cols <- c(Romania = 'green', Thailand = 'orange')
ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) +
geom_line() +
scale_color_manual(values = my_cols)scale_color_brewer offers some useful default color
schemes
ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) +
geom_line() +
scale_color_brewer(palette = 'Dark2')https://www.r-bloggers.com/a-detailed-guide-to-ggplot-colors/
Facets allow you to easily break a single plot into multiple plots based on variable.
ggplot(gap_early, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(se = FALSE) +
scale_x_log10() +
facet_wrap(~continent)Or based on multiple variables
ggplot(gap_early, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(se = FALSE) +
scale_x_log10() +
facet_grid(year ~ continent)gap_df <- gapminder %>%
filter(year == 1992, continent == 'Americas') %>%
mutate(gdp = gdpPercap * pop / 1e9) %>%
head(20)You can add text labels to the points with geom_text
ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp, label = country)) +
geom_text() +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')Or with geom_label
ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp, label = country)) +
geom_label() +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')Text labels are often not placed optimally
ggrepel is a very useful package that will automatically find good positioning for labels
library(ggrepel)
ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') +
geom_label_repel(aes(label = country), size = 2.5)There are lots of ways to add aesthetic improvements to your figures relatively easily
There are a number of pre-packaged ‘themes’ you can apply
Set the marker shape to one that can be ‘filled’ (pch = 21 is a filled circle), then use a thin white border around a filled shape to help distinguish overlaps.
ggplot(gap_92, aes(gdp, lifeExp)) +
geom_point(pch = 21, stroke = 0.5, alpha = 0.8, size = 2.5, color = 'white', aes(fill = continent)) +
scale_x_log10() +
labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)', title = 'Gapminder for 1992') +
theme_minimal()Add stats directly to your figures
library(ggpubr)
my_comparisons <- list( c("Africa", "Asia"), c('Europe', 'Oceania'))
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
geom_violin() +
scale_y_log10() +
stat_compare_means(method = 'wilcox.test', comparisons = my_comparisons)Easily add correlation coefficients
ggplot(gap_92, mapping = aes(x = lifeExp, y = gdpPercap)) +
geom_point() +
scale_y_log10() +
geom_smooth(method = 'lm') +
stat_cor()Great tool for combining multiple ‘panels’ into one plot
library(cowplot)
p1 <- ggplot(mtcars, aes(disp, mpg)) +
geom_point()
p2 <- ggplot(mtcars, aes(qsec, mpg)) +
geom_point()
plot_grid(p1, p2, labels = c('A', 'B'))ggplot2 struggles to make large heatmaps (geom_tile), for this ComplexHeatmap is the preffered tool
See VERY detailed documentation with examples here
Also contains useful information on the basics of hierarchical clustering
aes() function+ rather than
%>%theme()
layerstat_ layer